---
config:
theme: neutral
max-width: 600
---
flowchart TB;
A[Member 1] --> B(Project);
C[Member 2] --> B;
This document presents some general advice about how to collaborate, manage, store, and communicate on coding projects related to research.
This is just general advice! Of course, it depends on the project, the goal to be achieved, and keeping some flexibility is always a good idea!
The logfile
To monitor projects, and more specifically long-term projects, having a logfile is highly advised. This file registers all modifications being made with information such as the date, the author, and a brief explanation. It helps to track the evolution of the project, the main changes within the project, and also individuals’ contributions to the coding part of the research project. It is highly advised to use a repository using the cloud option rather than the GitHub one. By doing so, it helps to identify changes and also makes researching a specific file version easier. Here is a snapshot of the logfile used for an ongoing project:
# To-do
- [ ] Make some descriptive statistics about tenure status
- [ ] Merging and comparison with Orbis database
- [ ] Need to assess the quality of changes within the BO register
- [ ] Launch a new collection for 2025
- [ ] Outcomes about rental eviction, housing maintenance, and renters' income to compute
- [ ] Access to the TVVI database
# Previous changes
## 25.05.06
- Code for the internal workshop is ok. RL 2db4c55
- Figures are currently working. First push in a GitHub repo in the next weeks. RL 2db4c55
- Code review. RL 2db4c55
## 25.04.20
- New code to have a matching rate per percentile. RL 688eea8
- We also account for the individual shares and highlight that the 25% rule is a main limit for transparency. RL 688eea8
- New map for Paris. RL 688eea8The logfile is most of the time written in a lightweight open source format such as txt or Markdown to be easily readable and writable, regardless of the OS.
The logfile might also contain a to-do section. Research projects are not always linear, and writing future steps to introduce in this file is useful. Indeed, when we reopen a new project, looking at a logfile provides us with a good overview of what was achieved and what is needed.
Finally, sharing a logfile between contributors is a key element to coordinating your efforts. Taking a look at the logfile enables all team members involved in the project to see what the future steps are to be implemented, whereas they can easily monitor previous achievements. The authorship makes communication easier, especially if there is a misunderstanding in the coding parts.
Data accessibility
The data used for the empirical analysis is key information to provide to ensure consistent and reproducible information. Hence, we need to detail the following information:
- Data source: where can outsiders access the data?
- Data version: for updated datasets, specify the version used in this project (especially for flow data)
- Metadata
For the metadata, it is useful to list for all team members who might join the project what is being stored in the main datasets. Ideally, having a clear explanation of each column, what the observation unit is, whether it is a custom column or derived directly from raw data. In addition, having some units might be of interest as well. It is time-consuming to do so (especially for a large dataset), but it helps a lot for newcomers to understand the data. If an online documentation exists elsewhere, do not hesitate to redirect directly to this website. However, in such a case, please make sure that your columns have the same names as the original data.
README file
The README is probably the first file an outside individual will open when accessing your research project. In your README file, you must put important information such as:
- Title and authorship
- The main objective of the research project
- Information about how to access the data
- What are the main software being used in the project
- The license of your code, in our case, is mostly open, but it depends
- Any explanation that helps individuals understand your repository
The README is always written in an open file format, mainly Markdown.
An example of a README file
# Name of the project
## Overview
Here, we discuss the objective of the project, a snapshot of the main conclusion, and potential redirection to the paper.
## Features of the code
List the big steps of making your code accessible. For instance:
- Data collection
- Data cleaning
- Filtering of the dataset
- Descriptive statistics
- Econometric analysis
## License
This project is licensed under the MIT License. See the LICENSE file for repo details.
## Contact
For any questions or feedback, please contact:
Your Name:
GitHub: Organizing directory and files
Besides the scripts, structuring a coding directory is also a good practice to have. First, one script to do everything must be avoided. Research projects can be large, including some data loading, data management, descriptive statistics, and econometric analysis. One standalone file would be too large to be easily understandable by team members or outsiders.
On the other hand, spread-out files are hard to understand. Let’s say you join an ongoing project with multiple files to be handled. How to know which is the first to execute? Order in scripting files is a key element to ensure full reproducibility. Let’s say you run the econometric analysis before the filtering step; the results would be dramatically different. Hence, one file must aggregate everything and call each subscript.
We can call this script main_script and structure the code as follows (example from an ongoing project)
################################################################################
# INTREALES Project Code Preamble
################################################################################
# -----------------------------------------------------------------------------
# Project Information
# -----------------------------------------------------------------------------
# Author: Author 1, Author 2, Author 3
# Title: Code of super cool project
# Date: 2025-04-08
# Version: 1.0
# -----------------------------------------------------------------------------
# Load Necessary Libraries
# -----------------------------------------------------------------------------
# load all relevant packages for the analysis
source("init/packages.R")
theme_update(text = element_text(family = "serif"))
# -----------------------------------------------------------------------------
# Additional Setup or Configuration
# -----------------------------------------------------------------------------
output_table <- "output_code/table/" ## location of table outputs
output_figure <- "output_code/figure/" ## location of figure outputs
choice_w <- 16 # width of the output graphics (in inches)
choice_h <- 9 # height of the output graphics (in inches)
# -----------------------------------------------------------------------------
# Main Code
# -----------------------------------------------------------------------------
################################################################################
# Loading data -----------------------------------------------------------------
source("code/data/01_loading_data.R")
source("code/data/02_filtering_data.R")
# Descriptive statistics about the topic of interest ---------------------------
source("code/descriptive_statistics/01_summary_stat_sample.R")
source("code/descriptive_statistics/02_stat_observation_interest.R")
# Running an econometric analysis ---------------------------------------------
source("code/econometric_analysis/01_diff_in_diff.R")
source("code/econometric_analysis/02_robustness_chekcs.R")
source("code/econometric_analysis/03_placebo.R")* ################################################################################
* INTREALES Project Code Preamble
* ################################################################################
* -----------------------------------------------------------------------------
* Project Information
* -----------------------------------------------------------------------------
* Author: Author 1, Author 2, Author 3
* Title: Code of super cool project
* Date: 2025-04-08
* Version: 1.0
* -----------------------------------------------------------------------------
* Load Necessary Libraries
* -----------------------------------------------------------------------------
* In Stata, we typically use `ssc install` or `net install` to install packages.
* For example, to install a package, you might use:
* ssc install package_name
* -----------------------------------------------------------------------------
* Additional Setup or Configuration
* -----------------------------------------------------------------------------
* Define output directories
global output_table "output_code/table/" ## location of table outputs
global output_figure "output_code/figure/" ## location of figure outputs
* Define graphics dimensions
global choice_w = 16 # width of the output graphics (in inches)
global choice_h = 9 # height of the output graphics (in inches)
* -----------------------------------------------------------------------------
* Main Code
* -----------------------------------------------------------------------------
* ################################################################################
* Loading data -----------------------------------------------------------------
do "code/data/01_loading_data.do"
do "code/data/02_filtering_data.do"
* Descriptive statistics about the topic of interest ---------------------------
do "code/descriptive_statistics/01_summary_stat_sample.do"
do "code/descriptive_statistics/02_stat_observation_interest.do"
* Running an econometric analysis ---------------------------------------------
do "code/econometric_analysis/01_diff_in_diff.do"
do "code/econometric_analysis/02_robustness_checks.do"
do "code/econometric_analysis/03_placebo.do"The main script is quite simple to read, regardless of whether you are an R expert. The objective is to provide all the needed steps with some comments to understand each step. The script can be decomposed into different sections.
First, we introduce a preamble. We provide the key information about the project, such as the persons involved in it, the date, and the objective of the project. Second, we load everything we need. It avoids loading a package multiple times. Also, it is helpful when you need to provide all relevant packages to be loaded to an alternative member. You can also easily list all your packages when you need to provide the version to ensure full reproducibility. Third, we load the data. Then, we run the analysis.
Naming files is an important thing. When you open a coding directory, you want to understand easily how the directory is structured. That’s why we advise you to store your subscripts within subdirectories with explicit names such as descriptive statistics, data, or econometric_analysis. It helps to navigate through the directory and understand the code and the underlying choices (which is, in the end, the main thing). Within each subdirectory, we advise you to number the coding files, just to remind you in which order these scripts must be executed. In the same fashion, naming objects within the script should be as clear as possible to ensure team members understand what is what (but remind you that comments are helpful also!).
Finally, we can sum up the structure of the project as follows in a more general manner.
- code
- README.md
- main_script.R
- logfile.md
- descriptive_statistics
- 01_stat_code.R
- 02_stat_code.R
- loading_data
- 01_loading_data.R
- 02_filter_data.R
- econometric_analysis
- 01_diff_in_diff.R
- 02_robustness.R
- output_code
- figure
- fig_stat_des.png
- fig_stat_des_sample.png
- table
- tab_summary_stat.tex
- tab_main_results.tex
- figure
- code
- README.md
- main_script.do
- logfile.md
- descriptive_statistics
- 01_stat_code.do
- 02_stat_code.do
- loading_data
- 01_loading_data.do
- 02_filter_data.do
- econometric_analysis
- 01_diff_in_diff.do
- 02_robustness.do
- output_code
- figure
- fig_stat_des.png
- fig_stat_des_sample.png
- table
- tab_summary_stat.tex
- tab_main_results.tex
- figure
Here, we have two main directories (code and output_code). The second directory contains all figures and tables being produced by the script. Hence, it is easy to search for output figures.
Storing data
The main script is quite simple to read, regardless of whether you are an R expert. GitHub is a platform to share and collaborate on code. But, how about data? Generally speaking, we should avoid storing any data on it for two reasons. First, it relates to the sensitivity of the data.
To store data securely, you can use the PSE NextCloud solution. Each member of the EU Tax Observatory is a member of PSE NextCloud, which offers a Cloud solution for storing files, documents, or data. It is similar to common Cloud services such as Dropbox, but the servers are located in Paris, within PSE. Hence, it provides a way to store and share data between members, which is compliant with the RGPD. You can improve the security level of this repository by adding passwords or time restrictions.
In R, there is a package to link your coding repository with your NextCloud data repo.
Besides the storage aspect, it is important to store data in an easily readable format, such as CSV or Parquet, to ensure an easy opening for other members or replicators. Proprietary formats such as Excel should be avoided.
Code review
Finally, mistakes in coding files are common. Performing a code review from time to time, some code review might be a good practice to ensure that everything runs as planned. For instance, you can add a 0 when filtering data that affects the composition of your sample, introduce a filter to simplify the data process in the first place that remains, or just comment on some parts of the code that are useful. Making code review helps to track and correct these potential mistakes. When you are writing your coding file, you don’t necessarily have the hindsight to identify all issues (think about when writing a draft, you have some inconsistencies all along).
A code review is just looking at the entire code sequentially and ensuring that everything is fine. First, you must be sure that your code is not buggy when you are running it. If it happens, you must correct it. Second, you must track any typo issues, such as the filtering process, wrong column assignment, etc., that might affect the results. Third, you may find your code too complex. If you can simplify it through a partial re-writing process, you should do it. Even if comments help understand your process, the easier your code is, the more understandable it is.
Software versioning
Finally, a point that is mostly omitted in reproducible coding is software and package versioning. A package or software might evolve, which might introduce some bugs in your code. For instance, a code written in 2019 might be broken if we use the current packages. Hence, we need to specify the version of the main software (R, Python, or Stata) and also the attached version of the packages used. For R, we provide the following code to access software and package versions.
# it provides information about the R version being used
# also, it lists all packages being installed in your session
sessionInfo() R version 4.5.1 (2025-06-13)
Platform: aarch64-apple-darwin24.4.0
Running under: macOS Sequoia 15.6
Matrix products: default
BLAS: /opt/homebrew/Cellar/openblas/0.3.30/lib/libopenblasp-r0.3.30.dylib
LAPACK: /opt/homebrew/Cellar/r/4.5.1/lib/R/lib/libRlapack.dylib; LAPACK version 3.12.1
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Paris
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] htmlwidgets_1.6.4 compiler_4.5.1 fastmap_1.2.0 cli_3.6.5
[5] tools_4.5.1 htmltools_0.5.8.1 yaml_2.3.10 rmarkdown_2.29
[9] knitr_1.50 jsonlite_2.0.0 xfun_0.52 digest_0.6.37
[13] rlang_1.1.6 evaluate_1.0.4
# it returns the current version of the package of interest
packageVersion("data.table")[1] '1.17.8'
* Display Stata version information
version
* Display information about installed packages
ado dir
* To get more detailed information about a specific package, you can use:
ado describe <package_name>Here, the current version of R is 4.4.3, whereas the version of the data.table package is 1.17.0. I need to share this information to ensure that the results are fully reproducible, even in 10 years. An alternative is to use a Docker file, which provides all relevant files and packages to run your script and replicate your results.
Additional resources about coding
You can find here some resources:
- Naming Things in Code (YouTube Video)
- Why You Shouldn’t Nest Your Code (YouTube Video)
- Best Practices for Reproducible Code (Utrecht University)
Comments
Coding files can be hard to understand, especially for non-experts in the language of consideration. Moreover, data processing can be achieved following different packages, especially in R, which might make code reading difficult for outsiders. For instance, let’s take the following example.
If you are not an R expert, understanding such a chunk of code might be difficult. Hence, assessing whether there is an issue or a bad methodological choice is highly challenging. Now, let’s look at the commented version of this script.
Comments are extremely helpful for others to understand your code. As a result, they can more easily spot any coding mistakes or poor methodological choices.
Additionally, consider revisiting an old project six months or a year later. Your coding practices may have evolved, and you might have switched packages, making it more challenging to navigate through your code. This is particularly true for lengthy code files, which can be difficult to comprehend after being away from them for several months!
However, for straightforward code, comments are not mandatory as they complicate the reading of the code. Hence, commenting code is a balance between having a clear explanation about what is performed and keeping the code as clean as possible. Performing code review (see Section 8) helps to adjust code comments.